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We describe a set of techniques for performing large scale ab initio calcula- 
tions using multigrid accelerations and a real-space grid as a basis. The multi- 
grid methods provide effective convergence acceleration and preconditioning 



. on all length scales, thereby permitting efficient calculations for ill-conditioned 

. systems with long length scales or high energy cut-offs. We discuss specific 

implementations of multigrid and real-space algorithms for electronic struc- 
ture calculations, including an efficient multigrid-accelerated solver for Kohn- 
Sham equations, compact yet accurate discretization schemes for the Kohn- 
Sham and Poisson equations, optimized pseudopotentials for real-space calcu- 
lations, efficacious computation of ionic forces, and a complex-wavefunction 
implementation for arbitrary sampling of the Brillioun zone. A particular 
strength of a real-space multigrid approach is its ready adaptability to mas- 
sively parallel computer architectures, and we present an implementation for 
the Cray-T3D with essentially linear scaling of the execution time with the 
number of processors. The method has been applied to a variety of periodic 
and non-periodic systems, including disordered Si, a N impurity in diamond, 
A1N in the wurtzite structure, and bulk Al. The high accuracy of the atomic 
forces allows for large step molecular dynamics; e.g., in a 1 ps simulation of Si 
at 1100 K with an ionic step of 80 a.u., the total energy was conserved within 
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27 /xeV per atom. 

PACS: 71.15Pd, 71.15.-m, 71.15Nc, 02.70Bf 
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I. INTRODUCTION 



Over the last several decades algorithmic advances coupled with the development of high 
speed supercomputers, have made ab initio quantum mechanical simulations possible for 
a wide range of physical systems. These methods have been used to provide a theoretical 
framework for interpreting experimental results and even to accurately predict the material 
properties before experimental data were available. However, the calculations are currently 
restricted to systems containing a few hundred atoms.0 This limitation is set by the available 
computer power, and the scaling of the computational work with the number of atoms. One 
of the most successful of the recent techniques is the Car-Parrinello method! in which the 
electronic orbitals are expanded in plane-wave basis functions and the resulting Hamiltonian 
is iteratively diagonalized. 

The practical and efficient extension of ab initio quantum methods to larger and more 
difficult systems may be accomplished by the refinement and improvement of traditional 
methods or by the development of new techniques. Although highly successful, traditional 
plane-wave methods encounter considerable difficulties when they are applied to physical 
systems with large length scales, or containing first-row or transition-metal atoms. These 
difficulties may be partially eliminated by the use of preconditioned conjugate- gradient 
techniques, Hi optimized pseudopotentials,ll~i augmented-wave methodsjl or plane waves in 
adaptive coordinates.0EOl However, these methods are still constrained by the use of a plane- 
wave basis set, and the necessity of performing Fast Fourier Transforms (FFT) between the 
real and reciprocal spaces. While FFT's may be implemented in a highly efficient manner on 
traditional vector supercomputers, the current trend in supercomputer design is massively 
parallel architectures. It is difficult to implement efficient FFT algorithms on such machines, 
due to the required long range communications. 

Real-space methods are inherently local and therefore do not lead to a large communi- 
cation overhead. The scaling of several critical parts of large calculations is improved from 
0(N 2 log N) in a plane-wave representation to 0(N 2 ), where N is the number of atoms. 



3 



Furthermore, preconditioning and convergence acceleration are most effectively carried out 
in real space (q.v. Section III). A real-space formulation is also required for efficient imple- 
mentations of O(N) electronic structure methods, in which the computational work required 
scales linearly with the number of atoms. These methods impose a localization constraint 
on the electronic orbitals@ or the electron charge density,0 which eliminates the 0(N 3 ) 
orthogonalization step. 

Orbital-based real-space approaches, e.g. atom-centered or floating gaussians, are very 
well established. Recently, however, there has been substantial interest in developing real- 
space orbital-independent methods, which permit systematic studies of convergence in the 
spirit of plane-wave methods. These methods include finite elements,^ grids,@~ill and 
waveletsSl 

The finite-element method was applied by White et alM to one-electron systems. They 
used both conjugate-gradient and multigrid acceleration^ to find the ground-state wave- 
function. Two of the present authors^! used a basis with a high density of grid points in 
the regions where the ions are located, and a lower density of points in the vacuum re- 
gions, in conjunction with multigrid acceleration, to calculate the electronic properties of 
atomic and diatomic systems. The core electrons were explicitly included and nearly sin- 
gular pseudopotentials were used. The non-uniform grid led to order of magnitude savings 
in the basis size and total computational effort. The multigrid iterations, which provide 
automatic preconditioning on all length scales, reduced the number of iterations needed to 
converge the electronic wavefunctions by an order of magnitude in these multi-length-scale 
systems. Weare et a/.El used a similar method to solve for the ground state of H^. Wavelet 
basesil'^i were used to solve the LDA equations for atoms and the O2 molecule. Chelikowsky 
et a/.0 have used high-order finite-difference methods and soft nonlocal pseudopotentials on 
uniform grids to calculate the electronic structure, geometry, and short-time dynamics of 
small Si clusters and of an isolated SO2 molecule. Beck et a/.0 have used uniform grids and 
a smeared nuclear potential to examine the energetics and structures of atoms and small 
molecules. They employed multigrid iteration techniques to improve the convergence rates 



of the Kohn-Sham functional. Real-space grids in curvilinear coordinates have been used 
by Gygi and GalliS to compute the properties of atoms and CO2, and Zumbach et 
tested it on O2. Seitsonen et a/.il used a uniform grid approach with pseudopotentials and 
a conjugate gradients scheme to calculate the electronic structure of P2, and to study a 
positron trapped by a Cd vacancy in CdTe. 

In a previous communicationS the present authors have outlined a multigrid-based ap- 
proach suitable for large scale calculations, together with a number of test applications. 
These included calculations for a vacancy in a 64-atom diamond supercell, an isolated Ceo 
molecule using non-periodic boundary conditions, a highly elongated diamond supercell, and 
a 32-atom supercell of GaN that included the Ga 3d electrons in valence. Uniformly spaced 
grids were used; in this case an effective "cut-off" may be defined, which is equal to that 
of the plane-wave calculation that uses the same real-space grid for the FFT's. Whenever 
feasible, the corresponding calculations were also carried out using plane-wave techniques, 
and the two sets of results were in excellent agreement with each other. This paper pro- 
vides a comprehensive description of the real-space multigrid method, and reports extensions 
to non-uniform grids, non-cubic grids, and to molecular dynamics simulations with highly 
accurate forces. 

Several computational issues absent from plane-wave and orbital-based methods arise 
when using a real-space grid approach. In the plane-wave basis the action of the kinetic 
energy operator on the basis functions can be computed exactly, and the wavef unctions, 
potentials, and the electron charge density can be trivially expanded in the basis. Every 
basis-set integral, except those involving the LDA exchange-correlation functional, can be 
computed exactly. The computational errors in the calculations are mainly due to the 
truncation of the basis. In a real-space grid implementation, the Kohn-Sham equations 
must be discretized explicitly, which presents important trade-offs between accuracy and 
computational efficiency. Furthermore, the quantum-mechanical operators are known only 
at a discrete set of grid points, which can introduce a spurious systematic dependence of 
the Kohn-Sham eigenvalues, the total energy, and the ionic forces on the relative position 
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of the atoms and the grid. We have developed a set of techniques that overcome these 
difficulties and have been used to compute accurate static and dynamical properties of large 
physical systems, while taking advantage of the rapid convergence rates afforded by multigrid 
methods. 

This paper is organized as follows: In Section II a method for the accurate and efficient 
real-space discretization of the Kohn-Sham equations for cubic, orthorhombic, and hexagonal 
symmetries is described. Section III focuses on the multigrid algorithms, which greatly 
accelerate convergence of the electronic wavefunctions and of the Hartree potential. Tests 
of the convergence acceleration are described in Section IV. The calculation of ionic forces 
that are sufficiently accurate for large step molecular dynamics requires special methods, 
which are described in Section V. Section VI discusses performance issues for massively 
parallel supercomputers, and describes a highly scalable and efficient implementation on the 
Cray-T3D, which has been tested on up to 512 processors. The summary in Section VII is 
followed by several technical appendices. 

II. GRID-BASED DISCRETIZATIONS OF THE KOHN-SHAM EQUATIONS 

Electronic structure calculations that use a real-space mesh to represent the wavefunc- 
tions, charge density, and ionic pseudopotentials must address a new set of technical diffi- 
culties when compared with plane-wave methods. In a plane wave representation the form 
of the kinetic energy operator is obvious. In contrast, the representation of the kinetic en- 
ergy operator on a real-space grid is approximated by some type of finite differencing, the 
accuracy of which must be carefully tested. Below, we describe real-space discretizations for 
uniform cubic, orthorhombic, and hexagonal grids, as well as nonuniform scaled cubic grids 
that increase the resolution locally. For uniform cubic grids we also describe and test the 
extension to periodic systems with arbitrary sampling of the Brillioun zone. 
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A. Uniform Cubic and Orthorhombic Grids 

Previously,!^ we described a real-space approach that uses uniform cubic grids with T 
point k-space sampling. In direct comparisons with plane-wave calculations, we found nearly 
perfect agreement between the two methods for several test systems. We now provide further 
details of our method. 

In real-space the wavef unctions, the electron charge density, and the potentials are di- 
rectly represented on a uniform three-dimensional real-space grid of N gri( i points with linear 
spacing h grid . The physical coordinates of each point are 

J, k) (i hg r id, j hg r id, k hgrid) 

i = l,---,N x ; j — 1,- • ■ , N y \ k = l,---,N z . (1) 

The ions are described by norm-conserving pseudopotentialsiHH in the Kleinman-Bylander 
nonlocal form.il These potentials are interpolated onto the grid from their radial represen- 
tation. Exchange and correlation effects are treated using the local density approximation 
(LDA) of density functional theory, in which the total electronic energy of a system of 
electrons and ions may be expressed as 

Eld A = fn£n + J drp(r){e XC [p(r)} ~ Vxc[p(r)}} 

n=l 

~- J drp(r) Vffartree(r) + E ion ^ ion . (2) 

The minimization of this functional requires the solution of the Kohn-Sham equations 

H K s[i>n\ = -^V 2 ^„, + V efS i) n = e„Vn, (3) 

subject to the orthonormality constraint on the eigenf unctions < ipi\ipj >= 5^. The accurate 
discretization of these equations on the grid structure described by Eq. (0) requires appro- 
priate numerical representations of the integral and differential operators. All integrations 
are performed using the three-dimensional trapezoidal rule: 

jdrf(r)^hl rid J2f(v(i,j,k)). (4) 

ijk 



We have found that for high accuracy it is essential that the integrand /(r) be band-limited 
in the sense that its Fourier transform should have minimal magnitude in the frequency 
range G > G max = ir/h gr id. This is explicit in a plane-wave calculation since the basis is 
cut-off at a specific plane-wave energy. 

The discrete real-space grid also provides a kinetic-energy cut-off of approximately 
Gf% nax /2. Unlike the plane-wave basis, high-frequency components above this cut-off can 
nonetheless manifest themselves on the grid. This high-frequency behavior, which can intro- 
duce unphysical variation in the total energy or the electron charge density, is perhaps best 
seen when the ions, and hence their pseudopotentials, shift relative to the grid points. If the 
pseudopotentials contain significant high-frequency components near or above G max then, 
as the ions shift, the high frequency components are aliased to lower frequency components 
in an unpredictable manner. 

This effect can be decreased by explicitly eliminating the high-frequency components in 
the pseudopotentials by Fourier filtering. In the context of plane-wave calculations, King- 
Smith et a/.0 recognized that the real-space integration of the nonlocal pseudopotentials 
could differ significantly from the exact result computed in momentum space, unless the po- 
tentials were modified so that Fourier components near G max were removed. Fourier filtering 
of the pseudopotentials is thus required in real-space calculations for accurate results. It is, 
of course, possible to use unfiltered potentials on real-space grids provided the grid spacing 
is sufficiently small, but our experience shows that the total energy and the electron charge 
density are often sufficiently well converged for significantly larger grid spacings — provided 
that explicit pseudopotential filtering is used. We use a somewhat different Fourier filtering 
method than that proposed by King-Smith et al, but it produces essentially the same effect 
(see Appendix B). 

The differential operator in the Kohn-Sham equations is approximated using a generalized 
eigenvalue form: 
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^mehrM + B mehr [V ef ^ n } - e n B mehr [^ n }, (5) 

where A me h r and Hmehr are the components of the Mehrstellen discretization,^ which is 
based on Hermite's generalization of Taylor's theorem. It uses a weighted sum of the wave- 
function and potential values to improve the accuracy of the discretization of the entire 
differential equation, not just the kinetic energy operator. In contrast to the central finite- 
differencing method, this discretization uses more local information (next-nearest neighbors, 
for example). The definition of the fourth-order Mehrstellen operator used in the present 
work is specified by the weights listed in Table |, which pertain to both cubic and orthorhom- 
bic grids (see below). A more detailed analysis of the Mehrstellen operator and Eq. @ is 
given in Appendix A. 

This representation of the Kohn-Sham Hamiltonian is short-ranged in real space in the 
sense that the operator can be applied to any orbital in 0(N gri( i) operations. Specifically, the 
application of the A me /j r operator at a point involves a sum over 19 orbital values while the 
application of the T5 me hr operator requires a sum over 7 points. The local potential multiplies 
the orbital pointwise, and the short-ranged nonlocal projectors require one integration over 
a fixed volume around each ion and a pointwise multiplication. This sparseness permits 
the use of iterative diagonalization techniques, and the short-ranged representation of the 
Hamiltonian leads to an efficient implementation on massively parallel computers. 

The discussion up to this point has been restricted to uniform cubic grids, but the 
extension to a general orthorhombic grid is straightforward. There are now three separate 
grid spacings h x , h y , and h z with the coordinates of each grid point given by: 

r(i,j, k) = (ih x ,jh y ,kh z ) 
i = 1, • • • , N x ; j = 1, - • • , N y ; k = 1, • ■ ■ , N z . (6) 

The orthorhombic Mehrstellen operator described in Table [I] is used to discretize the Kohn- 
Sham equations and numerical integration is performed according to 

Jdrf(r) = h x h y h z J2f[r^J,k)}. (7) 

ijk 
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B. Extension to Arbitrary Bloch Wave- Vectors 



In the preceding Section the wavefunctions were assumed to be real, with the Brillioun- 
zone sampling restricted to the T point. When these restrictions are lifted the Kohn-Sham 
equations become 

V eff (k)ip nk = e n ip nk , (8) 

where the ip nk are now the periodic parts of the Bloch functions. They are complex- valued, 
which presents no additional difficulties in discretization. The nonlocal projectors in V e ff{\s) 
have been multiplied by the phase factor e lk r . The gradient term Vipnk is computed using 
a central finite difference expression, which in one dimension has the form: 

4-ffa)= E a n f(x t+n ) + 0(h 7 ), (9) 

dx n=-Z 

where a x = 3/(4h grid ), a 2 = -3/(20h grid ), a 3 = l/(60h grid ), and a_ n = -a n . For a 
cubic grid structure the three dimensional generalization of this is the sum of the individual 
expressions for each coordinate axis. Denoting this finite-difference operator by V, the 
discretization of the Kohn-Sham equation becomes 

B mehr [ik ■ V^) n k + ^k 2 tjj nk + V eff (k)tp nk ] 

= £nBmehr[i>nk]i (10) 

where A me h r and B me /, r are again the components of the Mehrstellen operator. 

The accuracy of the discretization was tested by calculating the lattice constant and 
bulk modulus for an 8-atom Si supercell. A 26-Ry equivalent cut-off was used with k-space 
sampling restricted to the Baldereschi point.S The calculated lattice constant was 5.38 A 
with a bulk modulus of 0.922 Mbar. These are in good agreement with the corresponding 
plane-wave calculation with the cut-off of 26 Ry: 5.39 A and 0.960 Mbar, respectively.il 
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Bulk aluminum was selected as an additional test case. A 4-atom cell was used with 
a 23 Ry equivalent cut-off and k-space sampling of 35 special points in the irreducible 
part of the Brillouin zone. Since Al has partially occupied orbitals, the bands near the 
Fermi level were occupied using a Fermi-Dirac broadening function of width 0.1 eV. The 
convergence of the total energy with respect to energy cut-off and the number of k-points 
was tested by increasing the cut-off from 23 to 49 Ry, which produced a change of only 10 
meV/atom, and by increasing the number of k-points to 56, which changed the total energy 
by 3 meV/atom. The calculated lattice constant and bulk modulus of 4.02 A and 0.734 
Mbar are in excellent agreement with the experimental values of 4.02 A and 0.722 Mbar, 
and with previous theoretical results of Lam and CohenB of 4.01 A and 0.715 Mbar. 



C. Uniform Hexagonal Grids 

Unlike plane-wave methods, where different symmetry groups can be handled easily, an 
efficient real-space implementation for periodic systems with non-orthogonal lattice trans- 
lation vectors requires considerable modifications to the orthorhombic-symmetry implemen- 
tation. 

The hexagonal grid describing the unit cell or supercell is generated by 
r(i, j, k) = h xy i n x + h xy j n 2 + h z k n 3 

1 = 1,...,^; j = l,---,N y ; k = l,---,N z , (11) 

where the rii are the hexagonal Bravais-lattice vectors.il The c/a ratio can be chosen arbi- 
trarily by varying the two independent grid spacings, h xy and h z , and the number of grid 
points, N x = N y and N z . However, one should use h xy ~ h z in order to maintain similar 
resolution both in the xy plane and along the z-axis. Because the indexing of this hexagonal 
grid is isomorphic to the cubic one, the computer representation of potentials and wave- 
functions does not change. The most important difference is in the discretization of the 
Kohn-Sham equations. We have implemented a mixed sixth-order kinetic energy operator. 
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This discretization is described in Appendix C, as well as the modifications required in the 
multigrid restriction and interpolation procedures (q.v. Section III). 

The above implementation has been tested on a 32-atom supercell of A1N in the wurtzite 
phase. The accuracy of the results was confirmed by comparison with plane-wave Car- 
Parrinello calculations on the same supercell. Generalized norm-conserving&^il Al and N 
pseudopotentials were used for both calculations with k-space sampling restricted to the T 
point. The cohesive energy from the real-space calculation was 11.5 eV per A1N unit, which 
compares well with the value of 11.6 eV obtained in the Car-Parrinello calculations. The 
eigenvalue degeneracies were identical in both calculations and the maximum difference in 
any eigenvalue was 0.04 eV. 

The extension of the real-space grid representation to other Bravais lattices proceeds in 
a similar manner, the only requirement being the existence of an accurate finite difference 
discretization. 



D. Scaled Grids 

In real-space calculations it is possible to add resolution locally. This is especially valu- 
able for systems with a wide range of length scales such as surface or cluster calculations. 
A high density of grid points can be used near the ions, with a low density in the vacuum 
regions. Other possible applications are simulations of impurities in bulk materials, where 
the impurity ions may require higher resolutions to be accurately represented. By using 
locally enhanced regions the required resolution may be added only where needed, thereby 
greatly reducing the total number of points required. 

Local enhancement of the grid resolution may be achieved by adding small high-resolution 
gridsEl'El onto a uniform global grid, or by using a coordinate transform to warp the grid 
structure. Our focus here will be on the second approach, which was first proposed by 
Gygi0 for plane-wave basis sets and recently extended to real space by several workers.l^E 
In the real-space approach, a continuous coordinate transform is applied to a uniform grid. 
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In general the transformation is non-separable, but we prefer a separable coordinate trans- 
formation in order to avoid mixed derivatives in the kinetic-energy operator. As a test of 
the utility and accuracy of this scaled-grid approach, we examined an interstitial oxygen 
impurity in Si0 for two grid layouts: a dense uniform grid with a 76-Ry cut-off and a scaled 
grid with a cut-off that varied from 18 Ry to 76 Ry. The scaling-transformation that maps 
the fictitious computational grid x to the physical warped grid x sca i e d is 



where Xq is the x-coordinate of the oxygen atom, L x is the length of the supercell in the 
x-direction, and /3 is an adjustable parameter between zero and one that controls the degree 
of scaling. The y and z coordinates are scaled analogously. The scaled grid required four 
times fewer points than the uniform grid to achieve the same convergence of the total energy. 

Since the coordinate transforms are continuous functions, the integration weights and the 
coefficients of the discretized kinetic-energy operator may be generated from the uniform 
grid values using the metric tensor of the transform in the manner outlined by Gygi.BH^I 
With these modifications, the calculations proceed as for the uniform grids, but a sixth-order 
central finite-difference operator is used to discretize the first and second derivatives because 
we have not constructed a Mehrstellen operator for the scaled grid. In each case (uniform 
and scaled grids), a 64-atom supercell was used and the silicon atoms were fully relaxed. 
The oxygen atom was held fixed in order to avoid Pulay corrections!! to the ionic forces, 
which would have been required if it and the scaled grid were free to move. The uniform and 
scaled-grid calculations are in very good agreement: the maximum difference in Kohn-Sham 
eigenvalues was 40 meV and the maximum difference in ionic coordinates was 0.03 A. 



convergence by employing a sequence of grids of varying resolutions. The solution is ob- 
tained on a grid fine enough to accurately represent the pseudopotentials and the electronic 




(12) 



III. MULTIGRID ALGORITHMS 



To efficiently solve Eq. (|J), we have used multigrid-iteration techniques that accelerate 
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wavef unctions. If the solution error is expanded in a Fourier series, it may be shown that 
iterations on any given grid level will quickly reduce the components of the error with wave- 
lengths comparable to the grid spacing but are ineffective in reducing the components with 
wavelengths large relative to the grid spacing.Bil The solution is to treat the lower fre- 
quency components on a sequence of auxiliary grids with progressively larger grid spacings, 
where the remaining errors appear as high frequency components. This procedure provides 
excellent preconditioning for all length scales present in a system and leads to very rapid 
convergence rates. The operation count to converge one wavefunction with a fixed potential 
is 0(N gri( i), compared to 0{N gri d logA^ r i(i) for FFT-based approaches.!^ 

There is no one multigrid algorithm but rather a collection of algorithms that share 
certain common features. In order to describe the implementation used in this work, we 
start with a description of a multigrid solver for Poisson's equation. This will then be used 
as a building block for the more sophisticated algorithms actually employed. 

A standard numerical problem that illustrates the multigrid algorithm is the Poisson 
equation, — V 2 VH artree = 47rp, defined on a rectangular cell of dimensions (L x , L y ,L z ) with 
periodic boundary conditions. A standard method of solving for Vhartree is to discretize the 
equation on a uniform three dimensional grid with spacing h gr id and N gr id total points. The 
differential operator —V 2 is represented by some form of finite differencing. This produces 
a set of linear algebraic equations 

Ax=b, (13) 

where x and b are the discretized forms of Vnartree and 47rp, respectively, and A is the 
finite-difference representation of —V 2 . If the system is small, direct matrix methods are 
an acceptable means of solving the equations; however, for large systems the work required 
scales as (N gri d) 3 , which is prohibitive. An alternative approach is to use an iterative re- 
laxation scheme such as the Jacobi method.0 In this technique, the solution is iteratively 
improved. First, define x as an approximate solution of Eq. flT3|), and the residual r, a 
measure of the solution error, as 
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r = b - Ax. (14) 

An improved x is generated using 

x new = x + AtKr, (15) 

where At is a pseudo time step and K is the inverse of the diagonal component of A. 
This approach will always converge to a solution for some value of At if A is diagonally 
dominant. However, the number of iterations required to reduce the magnitude of the 

2/3 

residual to a specified accuracy is proportional to N grid) so that the algorithmic cost to 
converge, 0{N 5 Jf d ), is too great .§ While more sophisticated relaxation methods such as 
Gauss-Seidel, successive overrelaxation (SOR), or the alternating direction implicit method 
(ADI)@ have improved convergence rates, they still scale as (N gr id) a with a > 1, and are 
too slow for the grid sizes required in electronic structure calculations. 

The slow convergence of the Jacobi method can be qualitatively understood by noting 
that because A is a short-ranged operator, the updated approximate solution, Eq. (|I5D, is a 
linear combination of nearby values. If the error in the current estimate of x is decomposed 
into Fourier components, it can be shown that one Jacobi iteration considerably reduces the 
high frequency components of the error, but many Jacobi iterations are needed to affect the 
longest wavelength components of the error. The overall convergence rate is then limited 
by that of the lowest frequency components. As the problem becomes larger, the lowest 
frequency representable on the grid becomes smaller and the convergence rate decreases. 

The essence of the multigrid approach is the observation that the individual frequency 
components of the error are best reduced on a grid where the resolution is of the same order 
of magnitude as the wavelength of the error component. This approach will maintain a high 
convergence rate for all frequency components of the error even when the problem size (and 
the grid) becomes very large. 

We first describe a multigrid algorithm to solve Poisson's equation that uses two grids; 
a fine grid of spacing h and a coarser auxiliary one of spacing H. In this work, we use a 
coarse-to-fine grid ratio H/h of 2, but other ratios are possible. The solution is generated 
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as follows: the high-frequency components of the solution error with wavelength »s h are 
reduced by one or two Jacobi iterations. The residual r^, which should be devoid of high- 
frequency variation, is computed and transferred to the coarse grid by restriction (see below). 
Next, Poisson's equation on the coarse grid with the residual as a source term is solved by 



using the same iteration procedure as in Eq. (15), but with an initial estimate of zero. The 



Jacobi iteration on this level removes error components with wavelength rs H. Finally, the 
coarse grid solution is interpolated to the fine grid and added to the fine grid solution. This 
process is referred to as a coarse-grid-correction scheme (CGC). A few applications of the 
CGC cycle are generally sufficient to solve Poisson's equation to machine precision even for 
extremely large systems. 

An obvious question is how the solution is obtained on the coarse grid. If the total 
number of grid points in the coarse grid is small, a direct matrix method will be sufficient. 
If this number is so large that a direct method is impractical, then a second, coarser grid level 
is introduced and the two-grid algorithm is repeated in a recursive manner. When multiple 
grid levels are used, the pattern of cycling through the grids also needs to be considered. 
We use a simple progression from the finest to the coarsest grid level and then back to the 
finest level, which is referred to as a V-cycle. More complicated cycling schemes exist, but 
we have found that the V-cycle works as well in electronic structure calculations as the more 
sophisticated approaches. Another consequence of the multigrid approach is the reduction 
in the size of the grid (and consequently in the work required) on each level. For a uniform 
grid in three dimensions a doubling of h gr id with each level leads to a factor of eight reduction 
in Ngrid, so that the addition of extra coarse grid levels is computationally inexpensive. 

The simplest choice for the restriction operator is to copy every other point in the fine 
grid directly to the coarse grid. This so-called straight injection, while easy to implement, 
does not always yield good convergence rates. A better choice is a weighted restriction, in 
which each coarse- grid value is the average of the 27 fine-grid values surrounding it. In our 
work, the weight assigned to each fine-grid point is proportional to the volume it occupies 
at a given coarse-grid point. A good choice for interpolating from the coarse grid to the fine 
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grid is the adjoint of the restriction operator, which in this case becomes simple tri-linear 
interpolation.El 

The final accuracy of the solution is determined by the finite-difference representation of 
— V 2 on the finest grid level. It is neither necessary nor desirable to use the same represen- 
tation of —V 2 on all grid levels; i.e., may differ from A# in the form of the discretization 
as well as in the grid spacing. The technique of changing discretization on different levels is 
referred to as deferred defect correction (DDC) or double discretization.0i! It is especially 
valuable for problems where the operator used on the finest grid level is numerically unsta- 
ble on the coarser grids, or when it is inconvenient to apply. The accuracy of the fine-grid 
solution does not depend on the choice of this coarse-grid operator, which effects only the 
convergence rate. In solving Poisson's equation, we use the Mehrstellen operator on the 
finest grid level for high accuracy. However, it is unsuitable for convergence acceleration on 
the coarser grids because of stability problems. Thus, on the coarser grids a 7-point central 
finite-difference operator is used, which provides excellent stability and rapid high-frequency 
attenuation. 

The extension of multigrid concepts to the solution of the Kohn-Sham equations intro- 
duces several complications. First, the equations are non-linear since the eigenvalues and 
the orbitals must be computed simultaneously. Second, when the Kleinman-Bylander form 
of the nonlocal pseudopotentials is used, the equations become a set of integro-differential 
eigenvalue equations. Finally, the Hamiltonian depends upon the density, and must be 
solved self-consistently. 

Brandt et a/.@ have given a multigrid algorithm for the standard eigenvalue problem. 
Below we describe an alternative means of linearizing the eigenvalue equations. The multi- 
grid technique recommended in the literature for non-linear integro-differential equations is 
the full approximation storage (FAS) method.0 In FAS the entire problem is discretized 
and solved on all grid levels. In contrast, the CGC method outlined above generates the full 
solution only on the finest level. While the theoretical performance of FAS on this problem 
is superior, its implementation is significantly more complex. In addition, it is difficult to 
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obtain an accurate representation of the nonlocal pseudopotentials on the coarser grid lev- 
els. Furthermore, the Kohn-Sham equations need not be converged to maximum accuracy at 
every iteration, because the electron charge density (and therefore the Hamiltonian) change 
after each multigrid step. 

For these reasons, a modification of the double discretization approach described above 
was used. The discretized operator on the finest grid is the Mehrstellen approximation of 
the Kohn-Sham Hamiltonian, while a 7-point central finite-difference representation of —V 2 
alone is used on the coarse grids. Effectively, the coarse-grid equation for the wavefunction 
residual becomes Poisson's equation, where the source term is the residual generated by the 
Mehrstellen operator on the fine grid, Eq. (1P8|). 

Our multigrid procedure begins with the selection of some initial wavefunctions and 
electron charge density. We postpone discussion of initialization techniques until later and 
assume that an adequate start has been generated. The following steps are then performed 
for each individual wavefunction: First, an estimate of the eigenvalue is calculated from the 
Rayleigh quotient of the generalized eigenvalue equation, Eq. (|5|): 



In the case of complex orbitals, the eigenvalue is calculated using the Rayleigh quotient of 
the real (or imaginary) part of the generalized eigenvalue equation, Eq. fllQ|) : 



Next, several Jacobi iterations are applied to the orbital on the finest grid using Eqs. fll4|) 
and fljjD, where the residual is computed as 



The fictitious time step At used in the Jacobi iteration is typically chosen between 0.8 and 
1.4 a.u. In the case of complex orbitals, the real and imaginary components of the orbital 
are updated separately, using the appropriate generalization of Eq. fllSl) . 



(ip n \H mehr [jj n ]) 
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Next, the residual is restricted to the first coarse grid. A DDC coarse-grid cycle begins 
using the 7-point central finite-difference representation of —V 2 instead of H mehr . Several 
auxiliary coarse grids can be used. When the coarse-grid correction is interpolated onto the 
finest grid, only a fraction (3cgc of it is added to the orbital, for reasons of stability. A value 
°f Pcgc — 0.5 has been found to work for almost all systems. (Larger values may produce 
much higher convergence rates on some systems while being unstable for others, so some 
experimentation is necessary.) 

Before transferring the residual to the coarse grid, it is essential that enough Jacobi 
iterations be performed to eliminate the high frequency components from the residual. Since 
the residual is used as the right hand side of a CGC correction cycle, any high frequency 
components will eventually be transferred to a coarser grid where they cannot be represented 
correctly, greatly reducing the effectiveness of the multigrid cycle. In some cases they may 
even make the process numerically unstable. In the above approach, the difficulties of 
discretizing the nonlocal pseudopotentials on the coarse grid levels are avoided because the 
potential term is computed on the finest grid and frozen thereafter. 

The steps outlined above in the DDC apply only to a single wavefunction. The full 
solution process also requires the application of the orthonormality constraints and an update 
of the electron charge density. The full solution process (one SCF step) consists of the 
following cycle. 

First, the DDC is applied to all of the wavefunctions. Next, the orthonormality con- 
straints are applied using the Gram-Schmidt procedure: 



j<i 



1>i 



new 



i 
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states- 



(19) 



The new electron charge density is generated by linear mixing: 



N states 




(20) 
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where /j is the occupation of the i state and a is a mixing parameter, generally set 
to a value between 0.5 and 0.9. Next, the Hartree potential is recomputed for the new 
charge density using a Mehrstellen DDC cycle, and a new exchange-correlation potential is 
generated. 

Finally, a subspace diagonalization may be performed at this point. This need only be 
done occasionally (every 10-20 SCF steps is generally adequate) in order to unmix eigenstates 
that may be close in energy. Because the Mehrstellen Hamiltonian leads to a non-hermitian 
generalized eigenvalue equation (see Appendix A), subspace diagonalization requires a brief 
discussion: We look for a unitary transformation of the current wavefunctions that bet- 
ter represents the eigenvectors of the Hamiltonian, and are led to the following eigenvalue 
equation for the subspace: 

Yl H m b n d n,l = Q 5 ^n^n,Z, (21) 
n n 

where 

if™* = (^ m \n mehr ^ n }} (22) 
B^ h n = {^m\B mehr M), (23) 

th 

and d n j is the matrix of coefficients of the unitary transformation for the r state. Because 
B^ b n is invertible (see Appendix A), the subspace equations are equivalent to 

EO^ = e ^> ( 24 ) 

n 

where C sub = (B sub )~ 1 H sub . The matrix C sub is not hermitian except when the subspace is a 
subset of the space of eigenvectors. Thus, we do not diagonalize C sub because its eigenvectors 
are not necessarily orthogonal, which would spoil the orthogonality of the electronic orbitals. 
Instead, we discard the anti-hermitian part of C sub , which is smaller than the hermitian part 
of C sub by 0(h^ rid ), and diagonalize the hermitian part. This approximation works well in 
practice, and is exact at convergence. 
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The hermitian approximation does not affect the final accuracy of the solution because 
the multigrid-assisted Jacobi iterations ultimately converge the orbitals. Nonetheless, accu- 
rate subspace rotations are essential for good convergence: compare the convergence rates 
in Figs. [L]-0. As a test, we compared subspace diagonalizations with the Mehrstellen and 
the hermitian sixth-order discretizations, and found that the convergence is significantly 
improved with the former. 

The cycle described above is repeated until the electronic system converges to the desired 
tolerance, which may be monitored by computing the RMS value of the residual vector for 
each wavefunction (see Eq. fll8|)). When this reaches a value of 10 -9 a.u. for all wavefunctions 
in the occupied subspace, the convergence is sufficient for the computation of forces that are 
accurate enough for large step molecular dynamics with excellent energy conservation. 

As was mentioned previously, the convergence rates depend on the choice of the initial 
wavefunctions and electron charge density. A poor choice can lead to slow convergence rates 
or in some cases the system will not converge at all. Apart from random initial wavefunctions 
or an approximate solution that is generated using an LCAO basis set, one can also use a 
double-grid scheme. In the latter method the initial solutions are generated on a grid with 
a spacing twice as large as that used for the final grid. The computational work on this 
coarse grid is eight times smaller than what is needed on the fine grid. The approximate 
coarse-grids wavefunctions are then interpolated to the fine grid and used as the initial guess. 
This process can reduce the number of SCF cycles needed on the finest grid level by a factor 
of two to three, thereby achieving significant savings in the computational effort. 

IV. TESTS OF MULTIGRID CONVERGENCE ACCELERATION 

The theoretical convergence rates of multigrid methods may, in principle, be calculated 
exactly for certain types of problems. For well-behaved partial differential equations such 
as Poisson's equation discretized on N gri d points, 0(N gri( i) total operations are required to 
obtain a solution accurate to the the grid-truncation error. This compares well with FFT 
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based methods which require 0[N gri d log (N gri d)] operations. For the Kohn-Sham equations, 
an exact theoretical bound on multigrid convergence rates is difficult to obtain due to self- 
consistency effects, and to the best of our knowledge this analysis does not yet exist. We 
have therefore elected to study convergence properties in an empirical fashion by performing 
tests on physical systems typical of the problems normally examined with density functional 
theory. 

In previous work0 the present authors examined convergence rates for 8-atom supercells 
of perfect diamond as a function of the effective kinetic-energy cut-off determined by the grid 
resolution, for a 32-atom supercell of GaN that included the Ga 3d electrons in valence, and 
for a highly elongated 96-atom diamond supercell. It was found that multigrid convergence 
rates were largely independent of energy cut-off and cell geometries. While promising, 
these results were obtained for perfect crystal configurations of semiconductor compounds, 
which are generally fairly easy to converge. In this article we present the results of a more 
systematic study that includes disordered systems. 

The first system selected was a 64-atom supercell of bulk silicon. The ions were rep- 
resented by a generalized norm-conserving pseudopotentialil~il and the grid spacing used 
corresponded to an energy cut-off of 12 Ry. The ionic positions were generated by a molecu- 
lar dynamics simulation at a temperature of 1000 K. Because the work required to converge 
to the ground state depends on the quality of the initial wavefunctions and charge density, we 
used random initial wavefunctions and a constant initial electron charge density to minimize 
any possible bias from the choice of a starting configuration. A small number (10% of the 
total) of conduction band states was included in the calculations. Fig. p] shows the conver- 
gence rate defined as the log l0 (E — Eq) , plotted as a function of iteration number, where each 
iteration represents a single SCF step. Results are shown for calculations performed with 
and without multigrid acceleration, where the latter used a steepest-descents algorithm. In 
addition, the two calculations were repeated with, and without subspace diagonalizations 
of the orbitals. For the calculations that included subspace diagonalizations, the procedure 
was applied every 8 SCF steps, which led to small discontinuities in the smooth evolution 
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of the total energy. The results show that maximum convergence rates are obtained when 
multigrid iterations are combined with subspace diagonalization. The slowest convergence 
occurs for steepest descents with no subspace diagonalization. For the two runs where sub- 
space diagonalizations were performed, the multigrid run converged at roughly 2.5 times the 
rate of the steepest descents approach. 

While these results are encouraging, bulk silicon is a relatively straight-forward test 
and is well handled by standard plane-wave methods. As an example of a more difficult 
system we have considered a 64-atom diamond supercell with a substitutional nitrogen 
impurity. Standard pseudopotentials&~il were used for both C and N. The strong N p 
potential required an energy cut-off of 63 Ry. The presence of a localized nitrogen donor 
level together with the 63 Ry cut-off makes the system more difficult to converge. Random 
initial wavefunctions were used, and Fig. |2] shows the observed convergence rates. The 
convergence rates for the two runs that use subspace diagonalization are a factor of 4 better 
for multigrid than for steepest descents. This relative improvement is considerably greater 
than that observed for the silicon cell and is the consequence of the automatic preconditioning 
provided by multigrid techniques for all of the length and energy scales present in the 
problem. The multigrid convergence rates are largely independent of the grid spacing, which 
roughly corresponds to the kinetic energy cut-off in plane-wave approaches. This is not true 
of the steepest descents algorithm, where the maximum stable time step that may be used 
decreases as the energy cut-off increases. 

When comparing the convergence rates of the multigrid and steepest descents approaches, 
the computational workload involved in each technique must also be considered. A particular 
advantage of multigrid methods, when compared to other convergence acceleration schemes, 
is their low computational cost. This is due to the factor of 8 reduction in the number of 
grid points on each successive multigrid level. The computational time per SCF step in the 
silicon and diamond runs described above increased by less than 10% when multigrid was 
used instead of steepest descents. For bigger systems, where the costs of orthogonalizing 
the orbitals and applying the nonlocal pseudopotentials begin to dominate the total compu- 
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tational time, the extra work needed for the multigrid accelerations becomes negligible. In 
terms of computational time, the 64-atom Si supercell described above required 1.6 seconds 
per SCF step on 64 processors of a Cray-T3D. 

V. IONIC FORCES AND MOLECULAR DYNAMICS 

Efficient structure optimizations and the calculation of dynamical quantities such 
as phonon frequencies and thermodynamic properties require accurate ionic forces. In 
plane-wave methods the ionic forces are computed by applying the Hellmann-Feynman 
theorem.El'El Since the derivative of the pseudopotentials may be expressed exactly within 
the plane-wave basis, the accuracy of the ionic forces is limited only by machine precision 
and the degree of convergence to the Born-Oppenheimer surface. 

For the grid-based approach the accuracy of Hellmann-Feynman forces is limited by the 
numerical error in computing the integrals of the derivatives of the pseudopotentials. This 
error decreases with grid spacing. The differentiation of the radial potentials and projectors 
must be performed with care to include the effects of the Fourier filtering. Alternatively, a 
derivative-free implementation of the Hellmann-Feynman forces can be used, which we term 
virtual displacements. In this scheme, the ionic pseudopotentials are numerically differenti- 
ated directly on the real-space grid. The ions are moved through a set of small displacements, 
while the electron charge density and the wavefunctions are held fixed. The potential energy 
is calculated for each displacement and finite-differenced to form the derivative. The forces 
computed by the two methods agree well. Most of the forces we have calculated to date 
have been computed using virtual displacements. 

A stringent test of the accuracy of the ionic forces is a constant-energy molecular dy- 
namics simulation. Over the course of the simulation any systematic errors in the forces 
will manifest themselves as poor energy conservation. A distinction has to be made between 
small random errors that appear as bounded oscillations in the total energy and errors that 
increase in magnitude with simulation time. The small random errors are expected in the 
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real-space approach because the energy of an ion varies by a small amount as its position 
changes relative to the grid points.0 This is of no particular concern as long as the mag- 
nitude of the variation is small and oscillatory in nature. Of greater concern are errors 
that are unbounded. These could arise from errors in the forces, errors in integrating the 
equations of motion of the ions, and lack of self-consistency due to inadequate convergence 
of the electronic wavef unctions. The first source of error was minimized by Fourier filtering 
of the ionic pseudopotentials. The second is generally not a problem unless the ionic time 
step is too large. For small time steps even a simple integrator such as the Verlet algorithm 
is sufficient, and larger time steps may be handled by using higher order integrators, such 
as the Beeman- Verlet method.0 The last source of error is the most significant because 
Hellmann-Feynman forces are only accurate to first order in the error of the wavef unctions. 
A high degree of self-consistency is thus necessary to obtain good energy conservation. 

A 64-atom silicon supercell was selected to test energy conservation on a typical system. 
The ions were given random initial displacements from the perfect crystal configuration, 
and several velocity rescaling steps were performed in order to attain an average ionic tem- 
perature of 1100 K. A constant-energy molecular-dynamics simulation over 1 ps was then 
carried out, using 80 a.u. time steps and third-order Beeman- VerletS integration of the ionic 
equations of motion. The potential, kinetic, and total energy of the system vs. simulation 
time are plotted in Fig. [| We observed good energy conservation: the maximum variation 
in the total energy was 1.75 meV, which corresponds to 27 /xeV per atom. 

VI. MASSIVELY PARALLEL IMPLEMENTATION 

The performance of a given algorithm when solving complicated problems depends not 
only on the theoretical efficiency, which may be quite high, but also on how adaptable the 
algorithm is to modern computer architectures. One example are certain classical molecular 
dynamics algorithms, which perform only slightly better on vector supercomputers than on 
low cost engineering workstations, even though the supercomputer's theoretical peak per- 
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formance may be an order of magnitude larger. A particular strength of the Car-Parrinello 
method has been its efficient implementation on vector supercomputers, such as the Cray- 
YMP. However, vector performance, while improving steadily, is unlikely to increase by 
several orders of magnitude per decade as has occurred in the past. At the same time, the 
development of powerful, low cost microprocessors and memory, has led to massively par- 
allel architectures consisting of a large number of microprocessors, linked by a high speed 
communication network. Although efficient implementations of plane-wave-based methods 
on massively parallel architectures exist, the FFT-based algorithms do not scale well with 
the number of processors because the FFT is a global operation. 

Below, we will describe a massively parallel implementation of the multigrid method. 
Although some of the code-optimization issues are architecture-specific, most are generic and 
thus applicable to any massively parallel computation. The target machine is the Cray-T3D, 
which uses up to 2048 DEC- Alpha microprocessors, each with a peak performance of 150 
MFlops. Each processor has 8 KB direct-mapped data and instruction caches and 8 MW of 
local memory. The processors are linked together in a three-dimensional torus arrangement 
for data communication. Three issues have to be addressed in order to write an efficient code 
for this type of machine: minimizing communication costs between processors, balancing the 
work-load on each processor, and code optimization on the individual processors. 

A. Data Decomposition and Load Balancing 

The majority of the data storage in the multigrid method consists of the wavefunction 
values on the real-space grid. We will consider the case where the points are distributed 
on a uniform three-dimensional rectangular grid. If N w f is the total number of wavefunc- 
tions, then N grid N w f total storage is required. The simplest possible decomposition of data 
is to store complete wavefunctions on each processing element (PE), where each PE stores 
N w f/N PE orbitals. While conceptually simple, this approach will perform poorly for large 
systems with many wavefunctions, because orthogonalizing wavefunctions residing on dif- 
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ferent PE's requires sending large amounts of data between processors. An alternative ap- 
proach, and the one adopted by us, is to use real-space data decomposition. In this method, 
each PE is mapped to a specific region of space. The electron charge density, Hartree po- 
tential, and each wavefunction are distributed by regions over the processors. With this 
approach interprocessor communication is restricted to two areas: the computation of in- 
tegrals on the real space grid (See Eq. (§])), and the application of the finite-differencing 
operators. 

For integration, the ideal optimization strategy is to organize the calculation so that as 
many integrals as possible are computed at once. This can be understood by considering 
the time required for interprocessor communication, which consists of a latency period and 
a transfer phase. The latency period is significant and is the same whether 1 or 1000 words 
of data is transferred. Our integration procedure is as follows: First, calculate the intra- 
processor contributions to the integral (i.e., integrate over the subdomains); Second, store 
as many of these local integrals as possible; Finally, transfer them between processors in 
blocks and complete the integration by summing the local integrals. 

It was straightforward to implement the above procedure in most cases, but the orthog- 
onalization step required significant modifications. In a standard implementation of the 
Gram-Schmidt orthogonalization algorithm, wavefunction overlaps and updates are com- 
puted sequentially, and the integrals cannot be computed in parallel. To reduce the number 
of data transfers, the following implementation of Gram-Schmidt orthogonalization was 
adopted. First, the overlap matrix S^- = (ip^ipj) is computed as above: the local parts of 
the overlap integrals are computed and stored on each processor; the integration is then 
completed by transferring them in blocks to the other processors. Second, the Cholesky 
factorization^ of the overlap matrix is computed: Sij = (C^C)ij. The Cholesky factor, C, 
is relevant because its components are the overlaps between the new orthogonal wavefunc- 
tions and the original ones.0 Finally, the diagonal components of the Cholesky factor are 
used to normalize the wavef unctions, and the off-diagonal ones are used to complete the 
orthogonalization: 

27 



r = 7r(i-Er^) 

i = 1, • • • , iV siaies . (25) 

For simplicity, the Cholesky factorization of is currently performed on each processor. 
The computational time to factorize scales as {N^j), but has not yet become a bottleneck. 
However, for very large systems (greater than 800 orbitals), a parallelized Cholesky factor- 
ization will save significant computer time and memory. 

The second area where interprocessor communication is required is the finite differencing 
of the wavefunctions and Hartree potential, since finite differencing is nonlocal. However, 
in the Mehrstellen discretization, the nonlocality is restricted to points within one grid 
unit in each Cartesian direction. Interprocessor communication is thus always limited to 
nearest neighbor PE's regardless of the size of the system. This low communication cost is a 
particular advantage of a Mehrstellen type approach as opposed to a central finite-difference 
approach, which requires a higher degree of nonlocality to achieve the same level of accuracy. 

Load balancing and the efficient use of all PE's is a major concern for any parallel 
algorithm. With the method described above, the load balancing is essentially perfect for 
all parts of the calculation except for the application of the nonlocal pseudopotentials. These 
are applied to the wavefunctions in localized volumes around each ion. If the distribution 
of ions in space is nonuniform, then some of the PE's will be idle for a fraction of each SCF 
step. However, actual calculations on many systems have shown that the application of the 
nonlocal potentials typically requires less than 10% of the total computational time on any 
PE, so that processor utilization will always exceed 90%. 

The efficiency of the massively parallel implementation described here is illustrated in 
Fig. |], which shows the speedup in execution time per step for a given problem as the 
number of PE's is increased. The graph indicates a superlinear relationship, which is an 
artifact due to single processor cache effects. There are two competing factors here. The 
first is the increased communication cost as the number of processors increases, which tends 
to reduce the speedup. The second is the reduction in the amount of data stored on each 
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processor and a consequent increase in the number of cache hits. As was discussed earlier, 
the communication costs are relatively small with the data model being used; since the 
cache is relatively small, cache hit effects outweigh these. An apparent superlinear speedup 
is observed. 

VII. SUMMARY 

We have described the development of a multigrid-based method that uses a real-space 
grid as a basis. The multigrid techniques provide preconditioning and convergence acceler- 
ation at all length scales, and therefore lead to particularly efficient algorithms. A specific 
implementation of multigrid methodology in the context of density functional theory was 
described and illustrated with several applications. The salient points of our implementa- 
tion include: (i) the development of new compact discretization schemes in real space for 
systems with cubic, orthorhombic, and hexagonal symmetry, and (ii) the development of 
new multilevel algorithms for the iterative solution of Kohn-Sham and Poisson equations. 
The accuracy of the discretizations was tested by direct comparison with plane-wave cal- 
culations when possible and were found to be in excellent agreement in all cases. These 
algorithms are very suitable for use on massively parallel computers and in O(N) methods. 
We described an implementation on the Cray-T3D massively parallel computer that led to 
a linear speedup in the calculations with the number of processors. 

The above methodology was tested on a large number of systems. A prior Communication 
described tests on C 60 molecule, and diamond and GaN supercells. The present article 
examined convergence properties in detail for a supercell of disordered Si and the N impurity 
in diamond. The multigrid techniques increased the convergence rates by factors of 2 and 
4, respectively, when compared to the steepest descents algorithm. An extension to non- 
uniform grids that uses a separable coordinate transform to change grid resolution locally, 
e.g., at the surface or near an impurity, was developed and tested on the O interstitial in 
Si. This extension results in only minor changes in methodology and coding, while the 
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reduction in basis set size and thus in computational effort can be significant. A complex 
version of multigrid code, capable of an arbitrary sampling of the Brillouin zone, was also 
was developed and tested on bulk Al. 

Large time-step molecular dynamics simulations require very accurate forces, which can 
potentially lead to difficulties in real space methods as the atoms move relative to the grid 
points. We have described a set of techniques based on Fourier filtering of pseudopotentials 
that eliminate these difficulties for grid spacings of sizes similar to those used in plane- wave 
calculations. A 1 ps test simulation of bulk Si at 1100 K conserved the total energy to within 
27 /j,eV per atom, and illustrated the high quality of these forces. Further applications of 
this methodology are in progress, including a simulation of surface melting of Si,S structural 
properties of large biomolecules that contain over 400 atoms^ and electronic and structural 
properties of In x Gai_ x N quantum wells.il The multigrid methodology is also very suitable 
for O(N) implementations, and tests results for a 216-atom cell of bulk Si were described 
recently.il 
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APPENDIX A: ANALYSIS OF THE MEHRSTELLEN OPERATOR 

The Mehrstellen discretization differs from central finite-differencing in two important 
respects: first, higher accuracy in the discretization is achieved by using more local infor- 
mation, but this accuracy is fully realized only at convergence; and second, the discretized 
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Kohn-Sham eigenvalue equation Eq. fl5|) is non-hermitian because the operator B me/ v does 
not commute with the potential operator. In this appendix, we examine the accuracy of 
the Mehrstellen discretization, and prove that the non-hermitian nature of ttmehr does not 
change the nature of the wavef unctions: they remain orthogonal. For simplicity, we analyze 
only the Mehrstellen discretization of the orthorhombic lattice. 

The fourth-order Mehrstellen discretization (see Table |) samples the Hamilton and the 
wavefunction at 19 points 

3 

A mehr [f(x)] = a/(x) + & n/(x ± h n x n ) 

n=l 



± h m x m ) (Al) 

3 

B mehr [f{x)} = a' f{x) + J2 & U( X ± hn**)- ( A2 ) 



71=1 



The accuracy of the Mehrstellen discretization is one order higher than the corresponding 
central finite-differencing one, but this accuracy is achieved only at convergence .0 The small 
h expansions of the A mehr and B me/ir . demonstrate this principle: 

A mehr = -V 2 - i-V 2 £ h 2 n V 2 + O(h') (A3) 

i/ 71=1 

Braehr = I + ^ £ ^V 2 + 0(/> 4 ) (A4) 
12 71=1 

Note that by construction, A me hr = ~B me hr{— V 2 ) to 0(h 4 ). Thus, the Mehrstellen dis- 
cretization of the Kohn-Sham equations is equivalent to 

B me/ir [H KS Vn - e„^n] + 0(/i 4 ). (A5) 

The 0(h 2 ) terms, implicit in the right-hand side, vanish at convergence, when Hxs^n = 
e n ip n . A similar analysis applies to the discretization of the Poisson equation: 

A me hr [Vff] ~ B me fc r [47Tp] = 

B mehr [-V 2 V H - Airp] + 0(h 4 ). (A6) 
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Unlike a plane-wave or central finite-differencing representation of the Kohn-Sham equa- 
tions, the Mehrstellen discretization Eq. ([5]) leads to a non-hermitian, generalized eigenvalue 
equation. Nonetheless, we prove that the right eigenvectors of the discretized operator, i.e. 
the electronic orbitals, are orthogonal because they are also eigenvectors of a hermitian 
Hamiltonian. The generalized eigenvalue equation can be recast into a more familiar form 
by multiplication by B~g hr (The invertibility of B me f, r is discussed below): 

2( B mehr A mehr)lpn + V e fflf) n = 

-O0„ + V eff ifj n = e„Vv, (A7) 

where C is a non-compact discretization of —V 2 of the same order as A mehr . The solutions 
of this equation, the ip n and e n , are the solutions of the original equation. Because A me h r 
and Bmehr are finite-differencing operators with constant coefficients, they are translation- 
ally invariant and thus commute. They are also hermitian. Thus, C = (B^ e(!r A me)i) .) is 
hermitian, and the wavefunctions of Eq. (|5|) are orthogonal. 

Eq. (|5|) is the preferred discretization for computation, and the equivalent Eq. ( |A7| ) is of 
formal interest only because the operators B~g hr and hence C are long-ranged and therefore 
computationally expensive to apply. 

Finally, we consider the invertibility of the B me ^ r operator. We show that under rea- 
sonable conditions B me /j r has no zero eigenvalues (in fact, it is a positive definite opera- 
tor) by arguing that its null space is empty. It is straightforward to show that the null 
space of ~B m ehr is comprised only of plane waves of maximum kinetic energy; namely, 
ipnuii(x,y,z) = e -M*/hi+y/h 2 +z/h 3 ) ( Qr y _^ etc ). gee Eq> below. Thus, the null 

space of B me /j r is empty whenever these plane waves cannot be represented on the real-space 
mesh. 

This condition can be realized in two ways: choice of grid size, or explicit projection. For 
periodic boundary conditions, when one or more of the linear dimensions N x , N y , or N z is 
odd, the maximum g- vector along that dimension is ir/hi(N x — 1)/N X < ir/hi. Second, if the 
grid discretization cannot be chosen to meet the formal invertibility condition, the pseudo 
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inversea of ~B m ehr exists and can be used; that is, the few vectors in the null space of Ti me hr 
are projected out from the wavef unctions. On physical grounds any orbital of such rapid 
variation should be excluded from the calculation because it is marginally representable on 
the mesh. The pseudo inverse of B me hr is 

B'L(x)= E eS/B mehr (g), (A8) 

null 

where the discrete Fourier transform of B me /, r is 

B(g) = ^Ecoste) 2 /3. (A9) 

APPENDIX B: FOURIER FILTERING OF PSEUDOPOTENTIALS 

The pseudopotentials are short-ranged: the Coulomb tail of the local potential is ex- 
plicitly canceled and added to the Madelung summation of the electrostatic energy, and by 
construction the nonlocal projectors have no Coulomb tail. The nonlocal projectors and 
short-ranged local pseudopotentials are Fourier filtered only once, when the appropriate po- 
tentials and grid spacing are selected. The filtering procedure attenuates the high-frequency 
components while maintaining the localization of the projectors and potentials. 

The unfiltered potentials or projectors are defined on a real-space radial grid, and are 
transformed to momentum space in order to filter the high-frequency components: 

V l>m ered(G) = F fllter (G/G cut ) J V^j^GrY dr, (Bl) 

where the cut-off function Ffu ter (G/G cut ) smoothly attenuates the radial Fourier transform 
beyond G > G cut . The cut-off wave vector is determined by the grid spacing: G cut = 
an/hgrid. The cut-off function is unity for G < G cut and equals e~^ G ^ Gcut ~ 1 ^ 2 for G > G cut . 
The parameters a and f3\ depend on the atomic species and are carefully adjusted to achieve 
the best results. 

After the momentum-space filtering, the back-transformed potentials and projectors will 
extend beyond the original core radius. For computational efficiency, it is important that 
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the nonlocal pseudopotentials be short-ranged. Accordingly, a second filtering in real space 
is applied to reduce the large-radius oscillations beyond an empirically-determined radius 
r cut . The second filtering function is unity below the cut-off radius and equals e _/32 ( r//rci "~ 1 ) 2 
above it. Example values for a carbon generalized norm-conserving pseudopotential with s 
and p nonlocalities are a = 4/7 and f3\ = 18, r cut = 2.5 bohr, and = 0.4. 

Since the filtering procedure modifies the pseudopotentials, it is necessary to determine 
whether the modified potentials affect the system's physical properties. Because the degree 
of filtering is set by the real-space grid spacing h gr id, the effect is similar to performing an 
under-converged plane-wave calculation. The last effects are well understood and can be 
measured quantitatively by progressively increasing the plane-wave cut-off. In particular, 
the main results of plane-wave calculations remain valid, even if they are significantly un- 
derconverged. This is due to the uniform convergence properties of plane waves, which form 
a translationally invariant basis set. Similarly, the convergence effects may be monitored for 
a real-space calculation by decreasing the grid spacing. In our tests we found that the total 
energy of the system converges to an asymptotic value in a manner similar to that observed 
with plane waves. 

APPENDIX C: HEXAGONAL DISCRETIZATION OF THE KOHN-SHAM 

EQUATIONS 

The hexagonal grid structure described in Eq. (|TI|) is a simple hexagonal lattice. Because 
the z axis is orthogonal to the xy plane, the —V 2 operator may be written in separable form 

- v 2 = -V 2 xy - V 2 . (CI) 

Along the z direction a sixth-order central finite-difference operator was selected 

-V 2 J(i,j,k)= £ a n f(i,j,k + n) + 0(h 6 z ), (C2) 

n=— 3 

where a_ n = a n and the a n are given in Table [TT]. For the xy plane the the lattice transla- 
tional vectors are not orthogonal, and a central finite-difference expression is not applicable. 
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Instead a composite form was selected 

-V 2 xy f(i,j,k)= Pn\f(i + n,j,k) + f(i,j + n,k) 

n=— 3 

+f(i + n,j-n,k))+0(h%), (C3) 

where /3_ n = /3 n and the /3 n are given in Table [TT]. 

In the multigrid solution process these sixth-order operators are only used on the finest 
grid level to compute the kinetic energy and the residual On coarser grid levels, a second- 
order operator is used to represent —V 2 ; viz., 

-Vj(iJ,k) = £ a'J(hj,k + n) + 0(hl), (C4) 

n=-l 

and 

-V 2 xy f(t,j,k)= £ f3' n [f(i + n,j,k) + f(i,j + n,k) 

n=-l 

+ f( i + n ,j-n,k)} + 0(h 2 xy ), (C5) 

where the discretization weights are listed in Table O. 

The multigrid restriction operator uses a volume weighting scheme with the weights 
adjusted for the hexagonal grid, and similarly, the hexagonal generalization of tri-linear 
interpolation is used to transfer the coarse-grid correction to the fine grid. The wavefunctions 
and Hartree potential are generated using multigrid iterations in exactly the same manner 
as was described in Section III except for the modifications described here. 
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FIGURES 

FIG. 1. Convergence rates for a disordered 64-atom Si cell at a 12 Ry equivalent cut-off. The 
convergence rate, log 10 (-E — Eq), is plotted against the number of self-consistent field (SCF) steps. 
Random initial wavefunctions were used with a constant initial density. The initial ionic positions 
were obtained from an equilibrated molecular dynamics simulation at 1000 K. SD represents con- 
vergence rates for the steepest descents algorithm, MG is for multigrid, SD-SD is steepest descents 
with subspace diagonalizations, and MG-SD is multigrid with subspace diagonalizations. 

FIG. 2. Convergence rates for a 64-atom diamond cell with a substitutional N impurity at a 
63 Ry equivalent cut-off. The convergence rate, log 10 (E — Eq), is plotted against the number of 
self-consistent field (SCF) steps. Random initial wavefunctions were used with a constant initial 
density. The notation is the same as in Fig. 1. 

FIG. 3. The potential, kinetic, and total energy of a molecular dynamics simulation of a 64 
atom silicon cell at a temperature of 1100 K. Third-order Beeman-Verlet integration with an 80 
a.u. time step was used for the integration of the ionic equations of motion. The total energy curve 
is multiplied by a factor of 100. The potential and total energies have been shifted by 251.171 a.u. 
so that they could appear together. 

FIG. 4. Speedup in execution time is plotted vs. number of processors for a massively parallel 
implementation of the code on a Cray-T3D. The test system is a 64 atom cell of GaN at a fixed, 
70 Ry equivalent cut-off. The solid line is a guide to the eye. 
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TABLES 

TABLE I. Discretization weights for the fourth-order orthorhombic Mehrstellen operators for 
the central, nearest-neighbor, and next nearest-neighbor grid points. See Eqs. (A12) and (A2) for 
the definitions of a, b n , and c njjn . The cubic-grid operator corresponds to hi = h gri( i. 
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K 




Cn,m 


^mehr 
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12 




1 1 



TABLE II. 


Discretization weights, 


listed by distance along basis set 


axes, for the sixth and 


second-order kinetic energy operators for the hexagonal grid. See also Eqe 


>. C2-C5. 
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Fig. 1 Briggs et al. 
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